This course is about the core themes of data science such as visualising, analyzing and interpreting data. This course is for everyone who wants to become a better data scientist! I also expect that this course will improve my coding skills. My github repository: https://github.com/Pedusal/IODS-project
I have done a model that explores students exam scores. I found that the attitude were the most significant variable when explaining the differencies in exam scores. I have learned how to visualize data and many cool plots that can be used studing the model validitation.
This dataset consist answers to a survey which was done in course Introduction to Social Statistics, fall 2014. The survey questions were related to students learning approaches. Questions were divided to three different categories: deep learning, surface learning and strategic learning. In this dataset answers to those question categories has been compined to columns: deep, surf and stra. The dataset includes also students ages, gender, attitude (a sum of 10 questions related to students attitude towards statistics) and test scores.
Structure and dimension of the dataset:
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## [1] 166 7
Graphical overview of the data and summaries of the variables in the data:
## Loading required package: ggplot2
The graphical overview shows us that variables attitude, stra and surf have normal distribution. Variable age disturibution is clearly toward left. Distributions of variables deep and points are toward right, allthough points distribution have also a quite fat tail on left side.
Attitude variable is correlated with points in both genders and the correlation is about 0.43. Also variable age is correlated with points but only among males. Variable surf has some correlation with variables attitude, deep and stra among males. All other variables have only little correlation or no correlation at all.
I chose variables attitude, stra and surf as explanatory variables and fitted a regression model where exam points is the target variable.
Summary of the fitted model:
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The statistical test and p-value measures that how likely the estimate is zero. Eg. if the p-value is very low then most likely the estimate is not zero and we can say that it is statistically significant.
In my model only attitude is statistically significant so I run regresion again using only that explanatory variable.
Summary of the new model:
##
## Call:
## lm(formula = Points ~ Attitude, data = learn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The estimate of the attitude variable is ~0.35 and it is statistically significant. This means that points and attitude are in relationship, where one additional attitude point gives 0.35 points more in exam according this model.
The multiple R-squered of this model is 0.19, which means that this model explain about 20% of changes in dependent variable (exam points). In other words, according this modele attitude towards statistics explains 20 prosent of differencies in Introduction to Social Statistics exam results.
Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage:
First the assumption that the errors of the model are normally distributed. QQ-plot of the residuals shows that this assumption is reosonable. The constant variance assumption implies that the size of the errors should not depend on the explanatory variables. This can be studied with a scatter plot of residuals versus model predictions. ALso this assumption seems to be valid since there is no any clear pattern in the scatter plot. Third plot shows us leverage of observations. Leverage measures how much impact a single observation has on the model. Based on that plot I would say that there is no single observation that would have too much leverage on the model.
This week dataset combines two dataset that approach student achievement in secondary education of two Portuguese schools. Datasets provides info about students performance in two different subject. Both of these datasets includes same information regarding students backround.
Names of the variables of the this week dataset:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
I chose to study variables age, sex, absences and family relationship (famrel) more closely and their relationship with alcohol consumption. My hypothesis is that age, sex and absences correlates positively with alcohol consumption and famrel correlates negatively. In addition I expect that male students use more alcohol than female.
The distributions of the chosen variables:
## Warning: attributes are not identical across measure variables;
## they will be dropped
There is no normal distribution among my variables. The absences variable does not have any reosonable pattern. A typical student age is between 15-18 and he has good quality of family relationship. Gender is roughly equally distributed among students.
It seems to be so that older students consume more alcohol than younger as I expected.
Students who have more absences seems to comsume more alcohol than students who do not skip classes, as I expected.
There is less high alcohol use among those students who have a good quality of family relationships.
Thre are more male students among high alcohol consumers according to this graph.
All in all, these plots support my earlier hypothesis.
Summary of the fitted model:
##
## Call:
## glm(formula = high_use ~ sex + age + famrel + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2270 -0.8362 -0.6109 1.0447 2.1507
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.77961 1.74949 -2.160 0.0307 *
## sexM 1.03576 0.24431 4.240 2.24e-05 ***
## age 0.18794 0.10214 1.840 0.0658 .
## famrel -0.30293 0.12784 -2.370 0.0178 *
## absences 0.08890 0.02272 3.913 9.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 421.43 on 377 degrees of freedom
## AIC: 431.43
##
## Number of Fisher Scoring iterations: 4
All the variables are statistically significant. Although the age variable is statistically significant only in 0.1 significance level. SexM and absences variables are statistically the most significant variables. Gender and quality of family relationships seems to have quite large impact on students high use of alcohol.
The coefficients of the model as odds ratios and confidence intervals for them:
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.02283157 0.0007056798 0.6843933
## sexM 2.81725434 1.7567833577 4.5865176
## age 1.20676457 0.9891582297 1.4778200
## famrel 0.73865126 0.5737167939 0.9488277
## absences 1.09297279 1.0473815343 1.1452177
These results are in line with my earlier stated hypothesis.
## failures absences sex high_use probability prediction
## 373 1 0 M FALSE 0.4050055 FALSE
## 374 1 7 M TRUE 0.4370325 FALSE
## 375 0 1 F FALSE 0.1391479 FALSE
## 376 0 6 F FALSE 0.2544631 FALSE
## 377 1 2 F FALSE 0.1757312 FALSE
## 378 0 2 F FALSE 0.1930123 FALSE
## 379 2 2 F FALSE 0.3047678 FALSE
## 380 0 3 F FALSE 0.3934424 FALSE
## 381 0 4 M TRUE 0.5500634 TRUE
## 382 0 2 M TRUE 0.4025644 FALSE
## prediction
## high_use FALSE TRUE
## FALSE 256 12
## TRUE 80 34
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67015707 0.03141361 0.70157068
## TRUE 0.20942408 0.08900524 0.29842932
## Sum 0.87958115 0.12041885 1.00000000
## [1] 0.2408377
The datasets is part of R package “MASS” and it consists housing values in suburbs of Boston. More details about variables can be founf from here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
Structure of the data:
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Dimensions of the data:
## [1] 506 14
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Variables rad and tax have the highest correlation (0,91). Other variable pairs that have high correlation (in abosolute value greater than 0,7) are “indus,nox”, “indus,tax”, “nox,age”, “rm,mediv”, “indus,dis”, “nox,dis”, “age,dis” and “lstat,medv”. Distributions of variables crim, dis and lstat are clearly skewed towards left. Rm and medv have quite normal distribution. Variables age and black have the distuributions that are skewed towards right.
Summaries of the variables in the data:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Summaries of the variables in the scaled data:
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
All variables have now a zero mean.
LDA (bi)plot:
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2500000 0.2400990 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.9444027 -0.89818627 -0.12090214 -0.8777896 0.46730537
## med_low -0.0930677 -0.35655390 -0.07742312 -0.5922344 -0.11443880
## med_high -0.3882249 0.08086577 0.17414622 0.2978238 0.07805081
## high -0.4872402 1.01710965 -0.11793298 1.0864317 -0.37358515
## age dis rad tax ptratio black
## low -0.8795921 0.8399873 -0.6826072 -0.7453656 -0.4280419 0.3791846
## med_low -0.4221774 0.4117277 -0.5384037 -0.5070395 -0.0489752 0.3155679
## med_high 0.3608703 -0.3086763 -0.4431573 -0.3577176 -0.2108896 0.1047946
## high 0.8085376 -0.8459554 1.6382099 1.5141140 0.7808718 -0.8491276
## lstat medv
## low -0.754222303 0.55294176
## med_low -0.174238992 0.03335807
## med_high 0.005807658 0.14820643
## high 0.838985979 -0.68643330
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08006000 0.69305966 -0.97550180
## indus 0.11442784 -0.16891135 0.31275943
## chas -0.12928575 -0.10570776 0.05060807
## nox 0.39905538 -0.78521198 -1.28526077
## rm -0.16009149 -0.06302664 -0.17573362
## age 0.13537099 -0.39072686 -0.27164268
## dis -0.03887342 -0.30622752 0.22228630
## rad 3.66530248 1.09408405 0.02741219
## tax 0.05086760 -0.13931149 0.52651305
## ptratio 0.10470326 -0.03579711 -0.22860249
## black -0.12236647 0.04573984 0.15173691
## lstat 0.17713001 -0.18804448 0.39326041
## medv 0.20745449 -0.35344309 -0.12003974
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9630 0.0277 0.0094
Variable rad is clearly the most influencial linear separator for the clusters.
Cross table of the results with the crime categories from the test set:
## predicted
## correct low med_low med_high high
## low 17 5 1 0
## med_low 5 11 9 0
## med_high 0 3 23 3
## high 0 0 0 25
The model seems to work quite well!
In this part I Calculate the distances between the observations, run k-means algorithm on the data and then investigate what is the optimal number of clusters and run the algorithm again.
Based on above total within sum of squres plot, I would say that 2 is optimal number of clusters.
Visualization of the clusters:
Variables age and rad are the most influencial linear separators for the clusters.
## [1] 404 13
## [1] 13 3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Distributions of variables Ado.birrth, GNI, Parli.F and Mat.Mor are clearly skewed towards left. Edu.Exp has quite normal distribution. Variables Life.Exp and Labo.FM have the distuributions that are skewed towards right.
Variables Life.Exp and Mat.Mor have the highest correlation in absolutevalue (-0,86). Other variable pairs that have high correlation (in abosolute value greater than 0,7) are “Mat.Mor,Edu.Exp”, “Edu.Exp,Life.Exp”, and “Mat.Mor,Ado.Birth”.
PC1 captures almost 100% of the variance in the data
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
The tea dataset is from FactoMineR package
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300 6
## Warning: attributes are not identical across measure variables;
## they will be dropped
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |